From Wikipedia, the free encyclopedia:
Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al.[1] It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output.[2] It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences.[3] The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.
From Wikipedia, the free encyclopedia:
SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map) and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion.[3] SAM files can be very large (10s of Gigabytes is common), so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.
The first thing to do is create a directory to store all the tutorial output files and data:
mkdir tutorial
Create a new environment called ‘alignment’ with all the required software installed:
conda create --yes --name alignment bowtie2 bwa entrez-direct samtools seqkit sra-tools=2.11.0=pl5262h37d2149_1 # only this version works currently (21/09/2022)
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
## Collecting package metadata (repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/alignment
##
## added / updated specs:
## - bowtie2
## - bwa
## - entrez-direct
## - samtools
## - seqkit
## - sra-tools==2.11.0=pl5262h37d2149_1
##
##
## The following NEW packages will be INSTALLED:
##
## bowtie2 bioconda/osx-64::bowtie2-2.4.5-py310h7d4de36_4
## bwa bioconda/osx-64::bwa-0.7.17-h1f540d2_9
## bzip2 conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
## c-ares conda-forge/osx-64::c-ares-1.18.1-h0d85af4_0
## ca-certificates conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
## curl conda-forge/osx-64::curl-7.83.1-h23f1065_0
## entrez-direct bioconda/osx-64::entrez-direct-16.2-h193322a_1
## gettext conda-forge/osx-64::gettext-0.19.8.1-hd1a6beb_1008
## hdf5 conda-forge/osx-64::hdf5-1.10.6-nompi_haae91d6_101
## htslib bioconda/osx-64::htslib-1.16-h567f53e_0
## icu conda-forge/osx-64::icu-70.1-h96cf925_0
## krb5 conda-forge/osx-64::krb5-1.19.3-hb98e516_0
## libcurl conda-forge/osx-64::libcurl-7.83.1-h23f1065_0
## libcxx conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
## libdeflate conda-forge/osx-64::libdeflate-1.13-h775f41a_0
## libedit conda-forge/osx-64::libedit-3.1.20191231-h0678c8f_2
## libev conda-forge/osx-64::libev-4.33-haf1e3a3_1
## libffi conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
## libgfortran conda-forge/osx-64::libgfortran-4.0.0-7_5_0_h1a10cd1_23
## libgfortran4 conda-forge/osx-64::libgfortran4-7.5.0-h1a10cd1_23
## libiconv conda-forge/osx-64::libiconv-1.16-haf1e3a3_0
## libidn2 conda-forge/osx-64::libidn2-2.3.3-hac89ed1_0
## libnghttp2 conda-forge/osx-64::libnghttp2-1.47.0-h5aae05b_1
## libsqlite conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
## libssh2 conda-forge/osx-64::libssh2-1.10.0-h47af595_3
## libunistring conda-forge/osx-64::libunistring-0.9.10-h0d85af4_0
## libxml2 conda-forge/osx-64::libxml2-2.9.14-hea49891_4
## libzlib conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
## llvm-openmp conda-forge/osx-64::llvm-openmp-14.0.4-ha654fa7_0
## ncbi-ngs-sdk bioconda/osx-64::ncbi-ngs-sdk-2.11.2-h247ad82_0
## ncurses conda-forge/osx-64::ncurses-6.3-h96cf925_1
## openssl conda-forge/osx-64::openssl-3.0.5-hfd90126_2
## ossuuid conda-forge/osx-64::ossuuid-1.6.2-h0a44026_1000
## perl conda-forge/osx-64::perl-5.26.2-hbcb3906_1008
## perl-uri bioconda/osx-64::perl-uri-1.71-pl526_3
## perl-xml-libxml bioconda/osx-64::perl-xml-libxml-2.0132-pl526h08abf6f_1
## perl-xml-namespac~ bioconda/osx-64::perl-xml-namespacesupport-1.11-pl526_1
## perl-xml-sax bioconda/osx-64::perl-xml-sax-0.99-pl526_1
## perl-xml-sax-base bioconda/osx-64::perl-xml-sax-base-1.09-pl526_0
## pip conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
## python conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
## python_abi conda-forge/osx-64::python_abi-3.10-2_cp310
## readline conda-forge/osx-64::readline-8.1.2-h3899abd_0
## samtools bioconda/osx-64::samtools-1.15.1-h7e39424_1
## seqkit bioconda/osx-64::seqkit-2.3.1-h527b516_0
## setuptools conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
## sra-tools bioconda/osx-64::sra-tools-2.11.0-pl5262h37d2149_1
## tbb conda-forge/osx-64::tbb-2021.5.0-hb8565cd_5
## tk conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
## tzdata conda-forge/noarch::tzdata-2022c-h191b570_0
## wget conda-forge/osx-64::wget-1.20.3-hd3787cc_1
## wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
## xz conda-forge/osx-64::xz-5.2.6-h775f41a_0
## zlib conda-forge/osx-64::zlib-1.2.12-hfd90126_3
## zstd conda-forge/osx-64::zstd-1.5.2-hfa58983_4
##
##
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate alignment
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the ‘alignment’ environment:
conda activate alignment
Test that Bowtie 2 and BWA are available:
which samtools
## /opt/miniconda3/envs/alignment/bin/samtools
Download the Ebola virus reference genome:
efetch -db nuccore -id AF086833 -format fasta > tutorial/AF086833.fasta
Download 1,000 sequencing reads from public data:
fastq-dump -X 1000 -O tutorial --split-files SRR1972739
## Read 1000 spots for SRR1972739
## Written 1000 spots for SRR1972739
Index the reference genome:
bwa index -p tutorial/AF086833 tutorial/AF086833.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index -p tutorial/AF086833 tutorial/AF086833.fasta
## [main] Real time: 0.005 sec; CPU: 0.023 sec
Align reads using BWA to the reference genome:
bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq > tutorial/SRR1972739.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 2000 sequences (202000 bp)...
## [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (63, 590, 2, 68)
## [M::mem_pestat] analyzing insert size distribution for orientation FF...
## [M::mem_pestat] (25, 50, 75) percentile: (85, 119, 183)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 379)
## [M::mem_pestat] mean and std.dev: (135.27, 62.49)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 477)
## [M::mem_pestat] analyzing insert size distribution for orientation FR...
## [M::mem_pestat] (25, 50, 75) percentile: (176, 219, 275)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 473)
## [M::mem_pestat] mean and std.dev: (227.64, 75.24)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 572)
## [M::mem_pestat] skip orientation RF as there are not enough pairs
## [M::mem_pestat] analyzing insert size distribution for orientation RR...
## [M::mem_pestat] (25, 50, 75) percentile: (71, 147, 192)
## [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 434)
## [M::mem_pestat] mean and std.dev: (149.31, 89.10)
## [M::mem_pestat] low and high boundaries for proper pairs: (1, 555)
## [M::mem_process_seqs] Processed 2000 reads in 0.107 CPU sec, 0.107 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq
## [main] Real time: 0.110 sec; CPU: 0.114 sec
The SAM format is used to store output from read alignment software. It is a TAB-delimited text format consisting of a header and alignment section. To look at each of these sections, we can use the view command.
Print the SAM header section:
samtools view -H tutorial/SRR1972739.sam
## @SQ SN:AF086833.2 LN:18959
## @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem tutorial/AF086833 tutorial/SRR1972739_1.fastq tutorial/SRR1972739_2.fastq
## @PG ID:samtools PN:samtools PP:bwa VN:1.15.1 CL:samtools view -H tutorial/SRR1972739.sam
The header section primarily contains metadata about the file and reference sequence. For example, whether the alignments in the file have been sorted. Refer to the SAM specification for full details.
Print the SAM alignment section:
# Output the first few lines
samtools view tutorial/SRR1972739.sam | head
## SRR1972739.1 83 AF086833.2 15684 60 69M32S = 15600 -153 TTTAGATTTAACAAGATACCGAGAAAATGAATTGATTTATGACAATAATCCTCTAAAAGGAGGACTCAAATGAGATATTGCAATTGAGTCCTCCTTTTAGA DDDDDEEEEEDEEFFFEDHHHHIJJJJJJJJJJJJJJJJJIGIJJJJJJJJJJJJJJJJJJJIGIGJJJJJJJJJJJJJJIJJJJJJIFHHHHFFFFFCCC NM:i:2 MD:Z:27C16G24 MC:Z:101M AS:i:59 XS:i:0 SA:Z:AF086833.2,15735,+,33M68S,60,0;
## SRR1972739.1 2115 AF086833.2 15735 60 33M68H = 15600 -136 TCTAAAAGGAGGACTCAATTGCAATATCTCATT CCCFFFFFHHHHFIJJJJJJIJJJJJJJJJJJJ NM:i:0 MD:Z:33 MC:Z:101M AS:i:33 XS:i:0 SA:Z:AF086833.2,15684,-,69M32S,60,2;
## SRR1972739.1 163 AF086833.2 15600 60 101M = 15684 153 GTATAATCGTGCTCACCTTCATCTAACTAAGTGTTGCACCCGGGAGGTACCAGCTCAGTACTTAACATACACATCTACATTGGATTTAGATTTAACAAGAT @BBFFFFFHHHHHJJJJJJJJJJJJJIIIJJGHIIJJJJJJJJIJJJGIJGIJJIJJJIJHHHHHHGFFFFFEEEEEEEDFDDDDDDDDDDEDDEDDDDDD NM:i:3 MD:Z:0A44A14T40 MC:Z:69M32S AS:i:90 XS:i:0
## SRR1972739.2 115 AF086833.2 4919 60 101M = 4863 -104 CAGGCTCCTGCGAATTGGAAACCAGGCTTTCCTCCAGGAGTTCGTTCTTCCACCAGTCCAACTACCCCAGTATTTCACCTTTGATTTGACAGCACTCAAAC DDDDDDDDDDDDDDDEDDDDCDDEEEEEFFFHHHHHIIIJJJJJJJJJIJHGJJJJIJJIIIHFJJJJJJJJJIHGFJJJJJJJJJJIHHHGHFFFFFBBB NM:i:1 MD:Z:51G49 MC:Z:47S54M AS:i:96 XS:i:0
## SRR1972739.2 179 AF086833.2 4863 60 47S54M = 4919 104 TGGTTTCCAATTCGCAGGAGCCTGAGGGGGTGATCCGGGATTCCAGGACCAATCCGCTTGTCAGAGTCAATCGGCTGGGTCCTGGAATCCCGGATCACCCC BDDDDDDDDDDDDCDDDBDDDDDDDDBDDDDDDDDBDDDEDEECEFFDFFHHGJJIFJJJJJIJJJIJJJHIJJJJJJIHFBJJIHIJHGHHHEDD?F@@@ NM:i:2 MD:Z:8A41T3 MC:Z:101M AS:i:45 XS:i:0 SA:Z:AF086833.2,4893,+,51S50M,60,1;
## SRR1972739.2 2211 AF086833.2 4893 60 51H50M = 4919 127 GGTCCTGGAATCCCGGATCACCCCCTCAGGCTCCTGCGAATTGGAAACCA FFDFFECEEDEDDDBDDDDDDDDBDDDDDDDDBDDDCDDDDDDDDDDDDB NM:i:1 MD:Z:20T29 MC:Z:101M AS:i:45 XS:i:0 SA:Z:AF086833.2,4863,-,47S54M,60,2;
## SRR1972739.3 67 AF086833.2 2531 60 60M41S = 2587 57 TATCCGGACTCCCTTGAAGAGGAATATCCACCATGGCTCACTGAAAAAGAGGCCATGAATTTATTCCTGTGATTCATTACTGGCCAATAAAATTGTTGACC CCCFFFFFHHHHHJJJIIIIJJJJJJJJJJJJJJJJJJJJJIJJIIJJJJFHIJIJJJJJJJHGHGHHHFFFFFFFEFEEEEEDDDDDDDDDDCDCDDDDD NM:i:2 MD:Z:5A47T6 MC:Z:81M20S AS:i:50 XS:i:0 SA:Z:AF086833.2,2618,-,50M51S,44,3;
## SRR1972739.3 2131 AF086833.2 2618 44 50M51H = 2587 -81 GGTCAACAATTTTATTGGCCAGTAATGAATCACAGGAATAAATTCATGGC DDDDDCDCDDDDDDDDDDEEEEEFEFFFFFFFHHHGHGHJJJJJJJIJIH NM:i:3 MD:Z:20G2G10A15 MC:Z:81M20H AS:i:35 XS:i:0 SA:Z:AF086833.2,2531,+,60M41S,60,2;
## SRR1972739.3 131 AF086833.2 2587 60 81M20S = 2531 -57 GAATGATGAGAATAGATTTGTTACACTGGATGGTCAACAATTTTATTGGCCAGTAATGAATCACAGGAATAAATTCATGGCCTCTTTTTCAGTGAGCCATG CC@FFFFFHHHHHJJJJJJJJJJJJJJJJJJJJGIJJJJJJJJJJJJJIJJJJHHJIHJJIGIJIIJIJIJJJJJIIJIGHHHHFFFFFCCEEECCEDDC@ NM:i:5 MD:Z:6A18T25G2G10A15 MC:Z:60M41S AS:i:56 XS:i:0
## SRR1972739.4 77 * 0 0 * * 0 0 CTCTAATGAATCCCTTAAGCCAGGAGCGTTAGTGTCAAACGCAACGCCCCGGGTTCATGATCCTGGATGGCTGGTCGAACCAGGGAGATGTCACTCTAATA CCCFFFFFHHHHHJJJJJJJJJIGIJJJIJJJJIGIJJJJHIJJJJIIJJHFFCDEEEDDDDDDDDDDDDDDDBCDDDDDDDDDDDDDDDEDEDDDDDDED AS:i:0 XS:i:0
## samtools view: writing to standard output failed: Broken pipe
## samtools view: error closing standard output: -1
The alignment section contains data about the input sequence alignments. Each line typically represents a single alignment and contains information about how the alignment was generated. For example, which reference sequence and at what position was the input sequence aligned. Again, refer to the SAM specification for full details.
It’s important to note that we rarely work with alignments in this text-based format. Usually, we convert the SAM file into a binary or compressed representation which we refer to as the BAM format. This allows output from large sequencing runs to be stored efficiently. It also helps if we sort the alignments by their mapping position as this allows us to produce what’s called an ‘index’ file. As the name suggests, this file is analogous to the index in a book. It tells us which part of the file to read to fetch alignments from a particular region of the reference sequence. To sort the alignments and output a BAM file, we can use the sort command.
Sort and convert from SAM to BAM format:
samtools sort tutorial/SRR1972739.sam > tutorial/SRR1972739.bam
You should be aware that older versions of samtools required two commands to achieve the same outcome:
samtools view -b tutorial/SRR1972739.sam > tutorial/SRR1972739.bam
samtools sort tutorial/SRR1972739.bam > tutorial/SRR1972739.sorted.bam
Once we have the sorted BAM file, we can then create an index file. We do this using the index command. Notably, we do not have to specify an output file name for this command. The index file is automatically created and named by appending .bai to the input file name.
Build an index for the BAM file:
samtools index tutorial/SRR1972739.bam
List the BAM and index file:
ls tutorial/SRR1972739.*
## tutorial/SRR1972739.bam
## tutorial/SRR1972739.bam.bai
## tutorial/SRR1972739.sam
It’s often useful to calculate various statistics on alignment files. These statistics can be used to answer a range of questions, some needed for basic quality control and others which have biological significance. We will go through the most common statistics which can be calculated by the samtools commands. Alignment summary statistics can be reported using the idxstats command. The output is TAB-delimited with each line consisting of reference sequence name, sequence length, number of mapped reads, and number of unmapped reads.
samtools idxstats tutorial/SRR1972739.bam
## AF086833.2 18959 1549 0
## * 0 0 524
In addition to reporting the number of alignments by reference sequence, we can use the flagstat command to report the number of alignments for each FLAG value as described in the SAM format specification.
samtools flagstat tutorial/SRR1972739.bam
## 2073 + 0 in total (QC-passed reads + QC-failed reads)
## 2000 + 0 primary
## 0 + 0 secondary
## 73 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 1549 + 0 mapped (74.72% : N/A)
## 1476 + 0 primary mapped (73.80% : N/A)
## 2000 + 0 paired in sequencing
## 1000 + 0 read1
## 1000 + 0 read2
## 1466 + 0 properly paired (73.30% : N/A)
## 1476 + 0 with itself and mate mapped
## 0 + 0 singletons (0.00% : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
The flagstat report is helpful to assess the overall alignment performance. For example, a low mapping rate may indicate a significant problem with the library and will require further investigation. If you ever find yourself confused about what a FLAG value represents, you can use the flags command to convert between the value and it’s description.
Convert FLAG value to description:
samtools flags PAIRED
## 0x1 1 PAIRED
Convert description to FLAG value:
samtools flags 1
## 0x1 1 PAIRED
In many genomics applications it is important to compute both read coverage and read depth at each position or region:
To calculate read coverage, we can use the coverage command. The output contains some useful values like the number of covered bases (covbases), the percentage of covered bases (coverage), and the mean depth of coverage (meandepth).
samtools coverage tutorial/SRR1972739.bam
## #rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
## AF086833.2 1 18959 1549 18545 97.8163 7.74941 38 59.9
To calculate read depth, we can use the depth command. The output is TAB-delimited with each line consisting of reference sequence name, coordinate, and number of mapped reads or alignments.
# Output the first few lines
samtools depth tutorial/SRR1972739.bam | head
## AF086833.2 114 1
## AF086833.2 115 1
## AF086833.2 116 1
## AF086833.2 117 1
## AF086833.2 118 1
## AF086833.2 119 1
## AF086833.2 120 1
## AF086833.2 121 1
## AF086833.2 122 1
## AF086833.2 123 2
For many genomics applications, the SAM file produced by most read alignment software will require some additional processing. You might have to remove unmapped reads, select only high-quality alignments, or use just a subset of regions. All of these tasks and more can be accomplished using the view command. There are a number of options available in this command, so it is worth examining the help information first:
samtools view --help
##
## Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
##
## Output options:
## -b, --bam Output BAM
## -C, --cram Output CRAM (requires -T)
## -1, --fast Use fast BAM compression (and default to --bam)
## -u, --uncompressed Uncompressed BAM output (and default to --bam)
## -h, --with-header Include header in SAM output
## -H, --header-only Print SAM header only (no alignments)
## --no-header Print SAM alignment records only [default]
## -c, --count Print only the count of matching records
## -o, --output FILE Write output to FILE [standard output]
## -U, --unoutput FILE, --output-unselected FILE
## Output reads not selected by filters to FILE
## -p, --unmap Set flag to UNMAP on reads not selected
## then write to output file.
## -P, --fetch-pairs Retrieve complete pairs even when outside of region
## Input options:
## -t, --fai-reference FILE FILE listing reference names and lengths
## -M, --use-index Use index and multi-region iterator for regions
## --region[s]-file FILE Use index to include only reads overlapping FILE
## -X, --customized-index Expect extra index file argument after <in.bam>
##
## Filtering options (Only include in output reads that...):
## -L, --target[s]-file FILE ...overlap (BED) regions in FILE
## -r, --read-group STR ...are in read group STR
## -R, --read-group-file FILE ...are in a read group listed in FILE
## -N, --qname-file FILE ...whose read name is listed in FILE
## -d, --tag STR1[:STR2] ...have a tag STR1 (with associated value STR2)
## -D, --tag-file STR:FILE ...have a tag STR whose value is listed in FILE
## -q, --min-MQ INT ...have mapping quality >= INT
## -l, --library STR ...are in library STR
## -m, --min-qlen INT ...cover >= INT query bases (as measured via CIGAR)
## -e, --expr STR ...match the filter expression STR
## -f, --require-flags FLAG ...have all of the FLAGs present
## -F, --excl[ude]-flags FLAG ...have none of the FLAGs present
## --rf, --incl-flags, --include-flags FLAG
## ...have some of the FLAGs present
## -G FLAG EXCLUDE reads with all of the FLAGs present
## --subsample FLOAT Keep only FLOAT fraction of templates/read pairs
## --subsample-seed INT Influence WHICH reads are kept in subsampling [0]
## -s INT.FRAC Same as --subsample 0.FRAC --subsample-seed INT
##
## Processing options:
## --add-flags FLAG Add FLAGs to reads
## --remove-flags FLAG Remove FLAGs from reads
## -x, --remove-tag STR
## Comma-separated read tags to strip (repeatable) [null]
## --keep-tag STR
## Comma-separated read tags to preserve (repeatable) [null].
## Equivalent to "-x ^STR"
## -B, --remove-B Collapse the backward CIGAR operation
##
## General options:
## -?, --help Print long help, including note about region specification
## -S Ignored (input format is auto-detected)
## --no-PG Do not add a PG line
## --input-fmt-option OPT[=VAL]
## Specify a single input file format option in the form
## of OPTION or OPTION=VALUE
## -O, --output-fmt FORMAT[,OPT[=VAL]]...
## Specify output format (SAM, BAM, CRAM)
## --output-fmt-option OPT[=VAL]
## Specify a single output file format option in the form
## of OPTION or OPTION=VALUE
## -T, --reference FILE
## Reference sequence FASTA FILE [null]
## -@, --threads INT
## Number of additional threads to use [0]
## --write-index
## Automatically index the output files [off]
## --verbosity INT
## Set level of verbosity
##
## Notes:
##
## 1. This command now auto-detects the input format (BAM/CRAM/SAM).
## Further control over the CRAM format can be specified by using the
## --output-fmt-option, e.g. to specify the number of sequences per slice
## and to use avoid reference based compression:
##
## samtools view -C --output-fmt-option seqs_per_slice=5000 \
## --output-fmt-option no_ref -o out.cram in.bam
##
## Options can also be specified as a comma separated list within the
## --output-fmt value too. For example this is equivalent to the above
##
## samtools view --output-fmt cram,seqs_per_slice=5000,no_ref \
## -o out.cram in.bam
##
## 2. The file supplied with `-t' is SPACE/TAB delimited with the first
## two fields of each line consisting of the reference name and the
## corresponding sequence length. The `.fai' file generated by
## `samtools faidx' is suitable for use as this file. This may be an
## empty file if reads are unaligned.
##
## 3. SAM->BAM conversion: samtools view -bT ref.fa in.sam.gz
##
## 4. BAM->SAM conversion: samtools view -h in.bam
##
## 5. A region should be presented in one of the following formats:
## `chr1', `chr2:1,000' and `chr3:1000-2,000'. When a region is
## specified, the input alignment file must be a sorted and indexed
## alignment (BAM/CRAM) file.
##
## 6. Option `-u' is preferred over `-b' when the output is piped to
## another samtools command.
##
## 7. Option `-M`/`--use-index` causes overlaps with `-L` BED file regions and
## command-line region arguments to be computed using the multi-region iterator
## and an index. This increases speed, omits duplicates, and outputs the reads
## as they are ordered in the input SAM/BAM/CRAM file.
##
## 8. Options `-L`/`--target[s]-file` and `--region[s]-file` may not be used
## together. `--region[s]-file FILE` is simply equivalent to `-M -L FILE`,
## so using both causes one of the specified BED files to be ignored.
As you can see, there are a lot of options. These allow the user to perform a huge variety of filtering and file manipulation. These options become especially powerful when combined with other samtools commands. Rather than run through an exhaustive tour of the view command, we will instead demonstrate the most typical use cases.
To filter alignments, we can use the -f and -F options. The first option will output alignments with a given FLAG value, and the second will output alignments without a given FLAG value. This can be quite counter-intuitive so we should look at some examples:
Output alignments where the read is unmapped:
# UNMAP (4)
# With the UNMAP flag (-f 4)
samtools view -f 4 tutorial/SRR1972739.bam > tutorial/unmapped.sam
Output alignments where the read is mapped (not unmapped):
# UNMAP (4)
# Without the UNMAP flag (-F 4)
samtools view -F 4 tutorial/SRR1972739.bam > tutorial/mapped.bam
Remember, that FLAG values can be summed together to represent a combination of FLAG values.
Output alignments where the reads is unmapped and is the first read of the pair:
# UNMAP (4)
# READ1 (64)
# UNMAP,READ1 (4 + 64 = 68)
# With the UNMAP,READ1 flag combination (-f 68)
samtools view -f 68 tutorial/SRR1972739.bam > tutorial/unmapped_read1.bam
Output alignments where the reads is mapped (not unmapped) and passed quality control (did not fail quality control):
# UNMAP (4)
# QCFAIL (512)
# UNMAP,QCFAIL (4 + 512 = 516)
# Without the UNMAP,QCFAIL flag combination (-F 516)
samtools view -F 516 tutorial/SRR1972739.bam > tutorial/mapped_qc.bam
To perform more specific filtering - and to make matters more confusing - you can select and disregard combinations of FLAG values. For example, to output alignments on the reverse strand we need to identify alignments without the unmapped FLAG value and with the reverse complemented FLAG value:
samtools view -c -F 4 -f 16 tutorial/SRR1972739.bam > tutorial/mapped_reverse.bam
Do not be discouraged if you find the FLAG values and the with or with not system of alignment output difficult to remember. Bioinformaticians the world over look this information up on a daily basis and there is no shame in searching online for a solution or reference. Aside from using FLAG values to output specific alignments, we can also choose based on the mapping quality. We can use the -q option to output alignments with mapping quality greater than or equal to a desired value:
# mapping quality >= 10
samtools view -b -q 10 tutorial/SRR1972739.bam > tutorial/SRR1972739.mapq.bam
Lastly, we can also output alignments from a specific region of the reference sequence. This is useful when you may want to focus on alignments which cover a particular region of the genome, like a specific gene body. To output alignments for a given region, simply specify the region on the command line with the required format (NAME:START-END).
# NAME = AF086833
# START = 1
# END = 1000
samtools view -b tutorial/SRR1972739.bam AF086833.2:1-1000 > tutorial/SRR1972739.region.bam
Finally, Samtools also provides a text-based genome browser to visually inspect the alignments:
# display as text (-d T)
# go to this position (-p AF086833.2:9972)
samtools tview -d T -p AF086833.2:9972 tutorial/SRR1972739.bam tutorial/AF086833.fasta
## 9981 9991 10001 10011 10021 10031 10041
## AGTACTATTTCAGGGTAGTCCAATTAGTGGCACGTCTTTTAGCTGTATATCAGTCGCCCCTGAGATACGCCACAAAAGTG
## .A............A.......GC.....A...................C....T.........................
## ,a,,,,,,,, .....A...................C....T.........................
## .A............A....... ,,,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
## .A............A.......GC.....A..................
## .A............A.......GC.....A...................C....T....
## .A............A.......GC.....A...................C....T.....................
## .A............A.......GC.....A.................
## .A............A.......GC.....A...................C....T.........................
## .A............A.......GC.....A...................C....T.......................
## ,a,,,,,,,,,,,,a,,,,,,,gc,,,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
## A............A.......GC.....A...................C....T.........................
## A............A.......GC.....A...................C....T.........................
## .......A.......GC.....A...................C....T.........................
## ,,,a,,,,,,,,,,,,,,,,,,,c,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,
## .......C....T.........................
## .....C....T.........................
The exercises below are designed to strengthen your knowledge of using Samtools to manipulate SAM files. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Create a directory to store the output files from each exercise:
mkdir exercises
mkdir exercises/cholera
mkdir exercises/zika
mkdir exercises/tuberculosis
Cholera is an acute diarrhoeal infection caused by eating or drinking food or water that is contaminated with the bacterium Vibrio cholerae. Cholera remains a global threat to public health and is an indicator of inequity and lack of social development. Researchers have estimated that every year, there are 1.3 to 4.0 million cases of cholera, and 21,000 to 143,000 deaths worldwide due to the infection.
efetch command to download the genome for the Vibrio cholerae bacteria. Note, the genome is split into two entries on the NCBI database: AP014524 and AP014525. You can either download each seperately and then concatenate the files or provide both accession numbers to the efetch command.efetch -db nuccore -id "AP014524,AP014525" -format fasta > exercises/cholera/MS6.fasta
fastq-dump command to download 1000 single-end reads from the public Vibrio cholerae sequencing run with accession number SRR16345277.fastq-dump -X 1000 -O exercises/cholera SRR16345277
## Read 1000 spots for SRR16345277
## Written 1000 spots for SRR16345277
bwa index exercises/cholera/MS6.fasta
## [bwa_index] Pack FASTA... 0.03 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.53 seconds elapse.
## [bwa_index] Update BWT... 0.02 sec
## [bwa_index] Pack forward-only FASTA... 0.01 sec
## [bwa_index] Construct SA from BWT and Occ... 0.17 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index exercises/cholera/MS6.fasta
## [main] Real time: 0.768 sec; CPU: 0.783 sec
bwa mem exercises/cholera/MS6.fasta exercises/cholera/SRR16345277.fastq > exercises/cholera/SRR16345277.sam
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 1000 sequences (100000 bp)...
## [M::mem_process_seqs] Processed 1000 reads in 0.023 CPU sec, 0.023 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/cholera/MS6.fasta exercises/cholera/SRR16345277.fastq
## [main] Real time: 0.028 sec; CPU: 0.030 sec
samtools sort exercises/cholera/SRR16345277.sam > exercises/cholera/SRR16345277.bam
samtools index exercises/cholera/SRR16345277.bam
samtools flagstat exercises/cholera/SRR16345277.bam
## 1006 + 0 in total (QC-passed reads + QC-failed reads)
## 1000 + 0 primary
## 0 + 0 secondary
## 6 + 0 supplementary
## 0 + 0 duplicates
## 0 + 0 primary duplicates
## 975 + 0 mapped (96.92% : N/A)
## 969 + 0 primary mapped (96.90% : N/A)
## 0 + 0 paired in sequencing
## 0 + 0 read1
## 0 + 0 read2
## 0 + 0 properly paired (N/A : N/A)
## 0 + 0 with itself and mate mapped
## 0 + 0 singletons (N/A : N/A)
## 0 + 0 with mate mapped to a different chr
## 0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools idxstats exercises/cholera/SRR16345277.bam
## AP014524.1 2936971 714 0
## AP014525.1 1093973 261 0
## * 0 0 31
# UNMAP (4)
samtools view -b -F 4 exercises/cholera/SRR16345277.bam > exercises/cholera/SRR16345277.mapped.bam
Zika virus is primarily transmitted by the bite of an infected mosquito from the Aedes genus, mainly Aedes aegypti, in tropical and subtropical regions. Aedes mosquitoes usually bite during the day, peaking during early morning and late afternoon/evening. This is the same mosquito that transmits dengue, chikungunya and yellow fever.
efetch command from EDirect to download and save the Zika reference genome (AY632535).efetch -db nuccore -id AY632535 -format fasta > exercises/zika/AY632535.fasta
esearch and efetch commands to download and save the run information from a public Zika sequencing project (PRJNA609110).esearch -db sra -query PRJNA609110 | efetch -format runinfo > exercises/zika/runinfo.csv
grep "SINGLE" exercises/zika/runinfo.csv | cut -d "," -f 1 > exercises/zika/runids.txt
fastq-dump command to download 10,000 single-end reads from each of the runs selected in Q3. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3.while read RUN; do
fastq-dump -X 10000 -O exercises/zika ${RUN}
done < exercises/zika/runids.txt
## Read 10000 spots for SRR14467421
## Written 10000 spots for SRR14467421
## Read 10000 spots for SRR14467422
## Written 10000 spots for SRR14467422
bwa index exercises/zika/AY632535.fasta
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa index exercises/zika/AY632535.fasta
## [main] Real time: 0.004 sec; CPU: 0.021 sec
while read RUN; do
bwa mem exercises/zika/AY632535.fasta exercises/zika/${RUN}.fastq > exercises/zika/${RUN}.sam
done < exercises/zika/runids.txt
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 10000 sequences (1480034 bp)...
## [M::mem_process_seqs] Processed 10000 reads in 0.585 CPU sec, 0.586 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/zika/AY632535.fasta exercises/zika/SRR14467421.fastq
## [main] Real time: 0.608 sec; CPU: 0.605 sec
## [M::bwa_idx_load_from_disk] read 0 ALT contigs
## [M::process] read 10000 sequences (1461417 bp)...
## [M::mem_process_seqs] Processed 10000 reads in 0.593 CPU sec, 0.593 real sec
## [main] Version: 0.7.17-r1188
## [main] CMD: bwa mem exercises/zika/AY632535.fasta exercises/zika/SRR14467422.fastq
## [main] Real time: 0.609 sec; CPU: 0.611 sec
while read RUN; do
samtools sort exercises/zika/${RUN}.sam > exercises/zika/${RUN}.bam
done < exercises/zika/runids.txt
while read RUN; do
samtools index exercises/zika/${RUN}.bam
done < exercises/zika/runids.txt
merge command to combine the alignments from all of the BAM files into a single BAM file.samtools merge exercises/zika/merge.bam exercises/zika/SRR*.bam
# Percent covered: 5.151%
# Mean coverage: 238x
samtools coverage -m exercises/zika/merge.bam
## AY632535.2 (10.8Kbp)
## > 90.00% │█ │ Number of reads: 23138
## > 80.00% │█ │
## > 70.00% │█ │ Covered bases: 556bp
## > 60.00% │█ │ Percent covered: 5.151%
## > 50.00% │██ │ Mean coverage: 238x
## > 40.00% │██ │ Mean baseQ: 34.1
## > 30.00% │██ │ Mean mapQ: 56.7
## > 20.00% │██ ▃ │
## > 10.00% │██ █ ▄ │ Histo bin width: 269bp
## > 0.00% │██ █ ▆█ │ Histo max bin: 100%
## 1 2.7K 5.4K 8.1K 10.8K
Tuberculosis (TB) is caused by bacteria (Mycobacterium tuberculosis) and it most often affects the lungs. TB is spread through the air when people with lung TB cough, sneeze or spit. A person needs to inhale only a few germs to become infected. Every year, 10 million people fall ill with tuberculosis (TB). Despite being a preventable and curable disease, 1.5 million people die from TB each year – making it the world’s top infectious killer.
efetch command from EDirect to download and save the Tuberculosis reference genome (AL123456).efetch -db nuccore -id AL123456 -format fasta > exercises/tuberculosis/AL123456.fasta
esearch and efetch commands to download and save the run information from a public Tuberculosis sequencing project (PRJNA575883).esearch -db sra -query PRJNA575883 | efetch -format runinfo > exercises/tuberculosis/runinfo.csv
cat exercises/tuberculosis/runinfo.csv | grep "SRA1348739" | cut -d "," -f 1 > exercises/tuberculosis/runids.txt
fastq-dump command to download 1,000 paired-end reads from each of the runs selected in Q3. Instead of manually creating the command for each run accession number, use a for loop to read each run accession number from the file created in Q3.while read RUN; do
fastq-dump -X 1000 -O exercises/tuberculosis --split-files ${RUN}
done < exercises/tuberculosis/runids.txt
## Read 1000 spots for SRR17333997
## Written 1000 spots for SRR17333997
## Read 1000 spots for SRR17333996
## Written 1000 spots for SRR17333996
## Read 1000 spots for SRR17333995
## Written 1000 spots for SRR17333995
## Read 1000 spots for SRR17333994
## Written 1000 spots for SRR17333994
## Read 1000 spots for SRR17333993
## Written 1000 spots for SRR17333993
## Read 1000 spots for SRR17333992
## Written 1000 spots for SRR17333992
## Read 1000 spots for SRR17333991
## Written 1000 spots for SRR17333991
## Read 1000 spots for SRR17333990
## Written 1000 spots for SRR17333990
## Read 1000 spots for SRR17333989
## Written 1000 spots for SRR17333989
## Read 1000 spots for SRR17333998
## Written 1000 spots for SRR17333998
bowtie2-build exercises/tuberculosis/AL123456.fasta exercises/tuberculosis/AL123456
## Settings:
## Output files: "exercises/tuberculosis/AL123456.*.bt2"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 4 (one in 16)
## FTable chars: 10
## Strings: unpacked
## Max bucket size: default
## Max bucket size, sqrt multiplier: default
## Max bucket size, len divisor: 4
## Difference-cover sample period: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## exercises/tuberculosis/AL123456.fasta
## Building a SMALL index
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## bmax according to bmaxDivN setting: 1102883
## Using parameters --bmax 827163 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 827163 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 6; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 1; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 630218 (target: 827162)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 7
## Reserving size (827163) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 645384 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 645385 for bucket 1
## Getting block 2 of 7
## Reserving size (827163) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 715092 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 715093 for bucket 2
## Getting block 3 of 7
## Reserving size (827163) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 484041 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 484042 for bucket 3
## Getting block 4 of 7
## Reserving size (827163) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 491241 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 491242 for bucket 4
## Getting block 5 of 7
## Reserving size (827163) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 776428 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 776429 for bucket 5
## Getting block 6 of 7
## Reserving size (827163) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 776107 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 776108 for bucket 6
## Getting block 7 of 7
## Reserving size (827163) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 523233 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 523234 for bucket 7
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 758552
## fchr[G]: 2208550
## fchr[T]: 3653164
## fchr[$]: 4411532
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5665053 bytes to primary EBWT file: exercises/tuberculosis/AL123456.1.bt2.tmp
## Wrote 1102888 bytes to secondary EBWT file: exercises/tuberculosis/AL123456.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 4411532
## bwtLen: 4411533
## sz: 1102883
## bwtSz: 1102884
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 275721
## offsSz: 1102884
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 22977
## numLines: 22977
## ebwtTotLen: 1470528
## ebwtTotSz: 1470528
## color: 0
## reverse: 0
## Total time for call to driver() for forward index: 00:00:02
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:00
## Time to reverse reference sequence: 00:00:00
## bmax according to bmaxDivN setting: 1102883
## Using parameters --bmax 827163 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 827163 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Split 1, merged 5; iterating...
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 551441 (target: 827162)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering Ebwt loop
## Getting block 1 of 8
## Reserving size (827163) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 523071 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 523072 for bucket 1
## Getting block 2 of 8
## Reserving size (827163) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 549662 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 549663 for bucket 2
## Getting block 3 of 8
## Reserving size (827163) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 532754 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 532755 for bucket 3
## Getting block 4 of 8
## Reserving size (827163) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 660062 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 660063 for bucket 4
## Getting block 5 of 8
## Reserving size (827163) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 692173 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 692174 for bucket 5
## Getting block 6 of 8
## Reserving size (827163) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 618745 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 618746 for bucket 6
## Getting block 7 of 8
## Reserving size (827163) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 702923 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 702924 for bucket 7
## Getting block 8 of 8
## Reserving size (827163) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 132135 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 132136 for bucket 8
## Exited Ebwt loop
## fchr[A]: 0
## fchr[C]: 758552
## fchr[G]: 2208550
## fchr[T]: 3653164
## fchr[$]: 4411532
## Exiting Ebwt::buildToDisk()
## Returning from initFromVector
## Wrote 5665053 bytes to primary EBWT file: exercises/tuberculosis/AL123456.rev.1.bt2.tmp
## Wrote 1102888 bytes to secondary EBWT file: exercises/tuberculosis/AL123456.rev.2.bt2.tmp
## Re-opening _in1 and _in2 as input streams
## Returning from Ebwt constructor
## Headers:
## len: 4411532
## bwtLen: 4411533
## sz: 1102883
## bwtSz: 1102884
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 20
## eftabSz: 80
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 275721
## offsSz: 1102884
## lineSz: 64
## sideSz: 64
## sideBwtSz: 48
## sideBwtLen: 192
## numSides: 22977
## numLines: 22977
## ebwtTotLen: 1470528
## ebwtTotSz: 1470528
## color: 0
## reverse: 1
## Total time for backward call to driver() for mirror index: 00:00:01
## Renaming exercises/tuberculosis/AL123456.3.bt2.tmp to exercises/tuberculosis/AL123456.3.bt2
## Renaming exercises/tuberculosis/AL123456.4.bt2.tmp to exercises/tuberculosis/AL123456.4.bt2
## Renaming exercises/tuberculosis/AL123456.1.bt2.tmp to exercises/tuberculosis/AL123456.1.bt2
## Renaming exercises/tuberculosis/AL123456.2.bt2.tmp to exercises/tuberculosis/AL123456.2.bt2
## Renaming exercises/tuberculosis/AL123456.rev.1.bt2.tmp to exercises/tuberculosis/AL123456.rev.1.bt2
## Renaming exercises/tuberculosis/AL123456.rev.2.bt2.tmp to exercises/tuberculosis/AL123456.rev.2.bt2
while read RUN; do
bowtie2 -x exercises/tuberculosis/AL123456 -1 exercises/tuberculosis/${RUN}_1.fastq -2 exercises/tuberculosis/${RUN}_2.fastq > exercises/tuberculosis/${RUN}.sam
samtools sort exercises/tuberculosis/${RUN}.sam > exercises/tuberculosis/${RUN}.bam
samtools index exercises/tuberculosis/${RUN}.bam
done < exercises/tuberculosis/runids.txt
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 157 (15.70%) aligned concordantly 0 times
## 822 (82.20%) aligned concordantly exactly 1 time
## 21 (2.10%) aligned concordantly >1 times
## ----
## 157 pairs aligned concordantly 0 times; of these:
## 65 (41.40%) aligned discordantly 1 time
## ----
## 92 pairs aligned 0 times concordantly or discordantly; of these:
## 184 mates make up the pairs; of these:
## 145 (78.80%) aligned 0 times
## 34 (18.48%) aligned exactly 1 time
## 5 (2.72%) aligned >1 times
## 92.75% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 119 (11.90%) aligned concordantly 0 times
## 865 (86.50%) aligned concordantly exactly 1 time
## 16 (1.60%) aligned concordantly >1 times
## ----
## 119 pairs aligned concordantly 0 times; of these:
## 43 (36.13%) aligned discordantly 1 time
## ----
## 76 pairs aligned 0 times concordantly or discordantly; of these:
## 152 mates make up the pairs; of these:
## 111 (73.03%) aligned 0 times
## 40 (26.32%) aligned exactly 1 time
## 1 (0.66%) aligned >1 times
## 94.45% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 165 (16.50%) aligned concordantly 0 times
## 822 (82.20%) aligned concordantly exactly 1 time
## 13 (1.30%) aligned concordantly >1 times
## ----
## 165 pairs aligned concordantly 0 times; of these:
## 49 (29.70%) aligned discordantly 1 time
## ----
## 116 pairs aligned 0 times concordantly or discordantly; of these:
## 232 mates make up the pairs; of these:
## 164 (70.69%) aligned 0 times
## 67 (28.88%) aligned exactly 1 time
## 1 (0.43%) aligned >1 times
## 91.80% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 158 (15.80%) aligned concordantly 0 times
## 824 (82.40%) aligned concordantly exactly 1 time
## 18 (1.80%) aligned concordantly >1 times
## ----
## 158 pairs aligned concordantly 0 times; of these:
## 42 (26.58%) aligned discordantly 1 time
## ----
## 116 pairs aligned 0 times concordantly or discordantly; of these:
## 232 mates make up the pairs; of these:
## 158 (68.10%) aligned 0 times
## 73 (31.47%) aligned exactly 1 time
## 1 (0.43%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 159 (15.90%) aligned concordantly 0 times
## 824 (82.40%) aligned concordantly exactly 1 time
## 17 (1.70%) aligned concordantly >1 times
## ----
## 159 pairs aligned concordantly 0 times; of these:
## 58 (36.48%) aligned discordantly 1 time
## ----
## 101 pairs aligned 0 times concordantly or discordantly; of these:
## 202 mates make up the pairs; of these:
## 158 (78.22%) aligned 0 times
## 43 (21.29%) aligned exactly 1 time
## 1 (0.50%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 139 (13.90%) aligned concordantly 0 times
## 851 (85.10%) aligned concordantly exactly 1 time
## 10 (1.00%) aligned concordantly >1 times
## ----
## 139 pairs aligned concordantly 0 times; of these:
## 41 (29.50%) aligned discordantly 1 time
## ----
## 98 pairs aligned 0 times concordantly or discordantly; of these:
## 196 mates make up the pairs; of these:
## 136 (69.39%) aligned 0 times
## 56 (28.57%) aligned exactly 1 time
## 4 (2.04%) aligned >1 times
## 93.20% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 187 (18.70%) aligned concordantly 0 times
## 807 (80.70%) aligned concordantly exactly 1 time
## 6 (0.60%) aligned concordantly >1 times
## ----
## 187 pairs aligned concordantly 0 times; of these:
## 86 (45.99%) aligned discordantly 1 time
## ----
## 101 pairs aligned 0 times concordantly or discordantly; of these:
## 202 mates make up the pairs; of these:
## 158 (78.22%) aligned 0 times
## 38 (18.81%) aligned exactly 1 time
## 6 (2.97%) aligned >1 times
## 92.10% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 195 (19.50%) aligned concordantly 0 times
## 790 (79.00%) aligned concordantly exactly 1 time
## 15 (1.50%) aligned concordantly >1 times
## ----
## 195 pairs aligned concordantly 0 times; of these:
## 53 (27.18%) aligned discordantly 1 time
## ----
## 142 pairs aligned 0 times concordantly or discordantly; of these:
## 284 mates make up the pairs; of these:
## 208 (73.24%) aligned 0 times
## 75 (26.41%) aligned exactly 1 time
## 1 (0.35%) aligned >1 times
## 89.60% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 215 (21.50%) aligned concordantly 0 times
## 775 (77.50%) aligned concordantly exactly 1 time
## 10 (1.00%) aligned concordantly >1 times
## ----
## 215 pairs aligned concordantly 0 times; of these:
## 83 (38.60%) aligned discordantly 1 time
## ----
## 132 pairs aligned 0 times concordantly or discordantly; of these:
## 264 mates make up the pairs; of these:
## 229 (86.74%) aligned 0 times
## 35 (13.26%) aligned exactly 1 time
## 0 (0.00%) aligned >1 times
## 88.55% overall alignment rate
## 1000 reads; of these:
## 1000 (100.00%) were paired; of these:
## 254 (25.40%) aligned concordantly 0 times
## 737 (73.70%) aligned concordantly exactly 1 time
## 9 (0.90%) aligned concordantly >1 times
## ----
## 254 pairs aligned concordantly 0 times; of these:
## 131 (51.57%) aligned discordantly 1 time
## ----
## 123 pairs aligned 0 times concordantly or discordantly; of these:
## 246 mates make up the pairs; of these:
## 227 (92.28%) aligned 0 times
## 15 (6.10%) aligned exactly 1 time
## 4 (1.63%) aligned >1 times
## 88.65% overall alignment rate
while read RUN; do
# Output BAM (-b)
# FLAGs present (PROPER_PAIR = 2)
samtools view -b -f 2 exercises/tuberculosis/${RUN}.bam > exercises/tuberculosis/${RUN}.PROPER_PAIR.bam
samtools index exercises/tuberculosis/${RUN}.PROPER_PAIR.bam
done < exercises/tuberculosis/runids.txt
samtools depth -H exercises/tuberculosis/*.PROPER_PAIR.bam > exercises/tuberculosis/depth.txt